Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

BONN: Bayesian Optimized Binary Neural Network

3.7.3

Bayesian Pruning

After binarizing CNNs, we pruned 1-bit CNNs under the same Bayesian learning framework.

Diﬀerent channels might follow a similar distribution, based on which similar channels are

combined for pruning. From the mathematical aspect, we achieve a Bayesian formulation of

BNN pruning by directly extending our basic idea in [78], which systematically calculates

compact 1-bit CNNs. We represent the kernel weights of the l-th layer K^las a tensor

∈R^C^l

o^×^C^l

i^×^H^l^×^W^l, where C^l

o ^and^C^l

i ^{denote the numbers of output and input channels,}

respectively, and H^land W ^lare the height and width of the kernels, respectively. For

clarity, we deﬁne

K^l= [K^l

1^,^K^l

2^{, ...,}^K^l

C^l

o^]^,

(3.104)

where K^l

i^{, i}^{= 1}^,²^{, ..., C}^l

o^{, is a 3-dimensional ﬁlter}^∈^R^C^l

i^×^H^l^×^W^l. For simplicity, l is omitted

from the remainder of this section. To prune 1-bit CNNs, we assimilate similar ﬁlters into

the same one based on a controlling learning process. To do this, we ﬁrst divide K into

diﬀerent groups using the K-means algorithm and then replace the ﬁlters of each group by

their average during optimization. This process assumes that Ki in the same group follows

the same Gaussian distribution during training. Then the pruning problem becomes how

to ﬁnd the average K to replace all Ki’s, which follows the same distribution. It leads to a

similar problem as in Eq. 3.99. It should be noted that the learning process with a Gaussian

distribution constraint is widely considered in [82].

Accordingly, Bayesian learning is used to prune 1-bit CNNs. We denote ϵ as the diﬀerence

between a ﬁlter and its mean, i.e., ϵ = K −K, following a Gaussian distribution for

simplicity. To calculate K, we minimize ϵ based on MAP in our Bayesian framework, and

we have

K = arg max

K ^p⁽^K^|^ϵ^{) = arg max}

K ^p⁽^ϵ^|^K⁾^p⁽^K⁾^,

(3.105)

p(ϵ|K) ∝exp(−¹

2ν ^||^ϵ^||²

2⁾^∝^exp(⁻¹

2ν ^||^K⁻^K^||²

2⁾^,

(3.106)

and p(K) is similar to Eq. 3.101 but with one mode. Thus, we have

min||K −K||²

2 ⁺^ν⁽^K⁻^K⁾^T^Ψ⁻¹⁽^K⁻^K⁾

+ ν log

det(Ψ)

(3.107)

which is called the Bayesian pruning loss. In summary, our Bayesian pruning solves the

problem more generally, assuming that similar kernels follow a Gaussian distribution and

will ﬁnally be represented by their centers for pruning. From this viewpoint, we can obtain

a more general pruning method, which is more suitable for binary neural networks than

the existing ones. Moreover, we take the latent distributions of kernel weights, features, and

ﬁlters into consideration in the same framework and introduce Bayesian losses and Bayesian

pruning to improve the capacity of 1-bit CNNs. Comparative experimental results on model

pruning also demonstrate the superiority of our BONNs [287] over existing pruning methods.

3.7.4

BONNs

We employ the three Bayesian losses to optimize 1-bit CNNs, which form our Bayesian

Optimized 1-bit CNNs (BONNs). To do this, we reformulate the ﬁrst two Bayesian losses